This dataset consists of 2.240 customers of supermarket XYZ with data on: Profile of customers, consumer habits, campaign performance and channel preferences.
It was obtained from the Kaggle website. Refer to the following link for the source: https://www.kaggle.com/datasets/jackdaoud/marketing-data?resource=download
The supermarket XYZ wants to understand the impact of the marketing campaigns they were performing and optimize the selection of best campaigns. In this direction, they want to predict if the customer will accept the following campaign or not, in order to save money spent on non-profitable marketing actions.
Also, the supermarket wants to better understand their customers in order to address their needs and offer relevant products for each of them. To really understand the customers, a precise profiling of each customer segment must be performed. Additionally, the management wants to derive some insights regarding customer’s segments consumption habits.
Example of what data looks like:
head(dataset)
#> ï..ID Year_Birth Education Marital_Status Income Kidhome Teenhome
#> 1 5524 1957 Graduation Single 58138 0 0
#> 2 2174 1954 Graduation Single 46344 1 1
#> 3 4141 1965 Graduation Together 71613 0 0
#> 4 6182 1984 Graduation Together 26646 1 0
#> 5 5324 1981 PhD Married 58293 1 0
#> 6 7446 1967 Master Together 62513 0 1
#> Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts
#> 1 2012-09-04 58 635 88 546 172
#> 2 2014-03-08 38 11 1 6 2
#> 3 2013-08-21 26 426 49 127 111
#> 4 2014-02-10 26 11 4 20 10
#> 5 2014-01-19 94 173 43 118 46
#> 6 2013-09-09 16 520 42 98 0
#> MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases
#> 1 88 88 3 8
#> 2 1 6 2 1
#> 3 21 42 1 8
#> 4 3 5 2 2
#> 5 27 15 5 5
#> 6 42 14 2 6
#> NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3
#> 1 10 4 7 0
#> 2 1 2 5 0
#> 3 2 10 4 0
#> 4 0 4 6 0
#> 5 3 6 5 0
#> 6 4 10 6 0
#> AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact
#> 1 0 0 0 0 0 3
#> 2 0 0 0 0 0 3
#> 3 0 0 0 0 0 3
#> 4 0 0 0 0 0 3
#> 5 0 0 0 0 0 3
#> 6 0 0 0 0 0 3
#> Z_Revenue Response
#> 1 11 1
#> 2 11 0
#> 3 11 0
#> 4 11 0
#> 5 11 0
#> 6 11 0
Mostly int(integer), there are 3 variables that are chr(character), two correspond to categories and one to date.
str(dataset)
#> 'data.frame': 2240 obs. of 29 variables:
#> $ ï..ID : int 5524 2174 4141 6182 5324 7446 965 6177 4855 5899 ...
#> $ Year_Birth : int 1957 1954 1965 1984 1981 1967 1971 1985 1974 1950 ...
#> $ Education : chr "Graduation" "Graduation" "Graduation" "Graduation" ...
#> $ Marital_Status : chr "Single" "Single" "Together" "Together" ...
#> $ Income : int 58138 46344 71613 26646 58293 62513 55635 33454 30351 5648 ...
#> $ Kidhome : int 0 1 0 1 1 0 0 1 1 1 ...
#> $ Teenhome : int 0 1 0 0 0 1 1 0 0 1 ...
#> $ Dt_Customer : chr "2012-09-04" "2014-03-08" "2013-08-21" "2014-02-10" ...
#> $ Recency : int 58 38 26 26 94 16 34 32 19 68 ...
#> $ MntWines : int 635 11 426 11 173 520 235 76 14 28 ...
#> $ MntFruits : int 88 1 49 4 43 42 65 10 0 0 ...
#> $ MntMeatProducts : int 546 6 127 20 118 98 164 56 24 6 ...
#> $ MntFishProducts : int 172 2 111 10 46 0 50 3 3 1 ...
#> $ MntSweetProducts : int 88 1 21 3 27 42 49 1 3 1 ...
#> $ MntGoldProds : int 88 6 42 5 15 14 27 23 2 13 ...
#> $ NumDealsPurchases : int 3 2 1 2 5 2 4 2 1 1 ...
#> $ NumWebPurchases : int 8 1 8 2 5 6 7 4 3 1 ...
#> $ NumCatalogPurchases: int 10 1 2 0 3 4 3 0 0 0 ...
#> $ NumStorePurchases : int 4 2 10 4 6 10 7 4 2 0 ...
#> $ NumWebVisitsMonth : int 7 5 4 6 5 6 6 8 9 20 ...
#> $ AcceptedCmp3 : int 0 0 0 0 0 0 0 0 0 1 ...
#> $ AcceptedCmp4 : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ AcceptedCmp5 : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ AcceptedCmp1 : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ AcceptedCmp2 : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ Complain : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ Z_CostContact : int 3 3 3 3 3 3 3 3 3 3 ...
#> $ Z_Revenue : int 11 11 11 11 11 11 11 11 11 11 ...
#> $ Response : int 1 0 0 0 0 0 0 0 1 0 ...
Check for NA values:
Check for outliers:
summary(dataset)
#> ï..ID Year_Birth Education Marital_Status
#> Min. : 0 Min. :1893 Length:2240 Length:2240
#> 1st Qu.: 2828 1st Qu.:1959 Class :character Class :character
#> Median : 5458 Median :1970 Mode :character Mode :character
#> Mean : 5592 Mean :1969
#> 3rd Qu.: 8428 3rd Qu.:1977
#> Max. :11191 Max. :1996
#>
#> Income Kidhome Teenhome Dt_Customer
#> Min. : 1730 Min. :0.0000 Min. :0.0000 Length:2240
#> 1st Qu.: 35303 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
#> Median : 51382 Median :0.0000 Median :0.0000 Mode :character
#> Mean : 52247 Mean :0.4442 Mean :0.5062
#> 3rd Qu.: 68522 3rd Qu.:1.0000 3rd Qu.:1.0000
#> Max. :666666 Max. :2.0000 Max. :2.0000
#> NA's :24
#> Recency MntWines MntFruits MntMeatProducts
#> Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
#> 1st Qu.:24.00 1st Qu.: 23.75 1st Qu.: 1.0 1st Qu.: 16.0
#> Median :49.00 Median : 173.50 Median : 8.0 Median : 67.0
#> Mean :49.11 Mean : 303.94 Mean : 26.3 Mean : 166.9
#> 3rd Qu.:74.00 3rd Qu.: 504.25 3rd Qu.: 33.0 3rd Qu.: 232.0
#> Max. :99.00 Max. :1493.00 Max. :199.0 Max. :1725.0
#>
#> MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
#> Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
#> 1st Qu.: 3.00 1st Qu.: 1.00 1st Qu.: 9.00 1st Qu.: 1.000
#> Median : 12.00 Median : 8.00 Median : 24.00 Median : 2.000
#> Mean : 37.53 Mean : 27.06 Mean : 44.02 Mean : 2.325
#> 3rd Qu.: 50.00 3rd Qu.: 33.00 3rd Qu.: 56.00 3rd Qu.: 3.000
#> Max. :259.00 Max. :263.00 Max. :362.00 Max. :15.000
#>
#> NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
#> Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.000
#> 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 3.00 1st Qu.: 3.000
#> Median : 4.000 Median : 2.000 Median : 5.00 Median : 6.000
#> Mean : 4.085 Mean : 2.662 Mean : 5.79 Mean : 5.317
#> 3rd Qu.: 6.000 3rd Qu.: 4.000 3rd Qu.: 8.00 3rd Qu.: 7.000
#> Max. :27.000 Max. :28.000 Max. :13.00 Max. :20.000
#>
#> AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1
#> Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
#> 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
#> Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
#> Mean :0.07277 Mean :0.07455 Mean :0.07277 Mean :0.06429
#> 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
#> Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
#>
#> AcceptedCmp2 Complain Z_CostContact Z_Revenue
#> Min. :0.00000 Min. :0.000000 Min. :3 Min. :11
#> 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:3 1st Qu.:11
#> Median :0.00000 Median :0.000000 Median :3 Median :11
#> Mean :0.01339 Mean :0.009375 Mean :3 Mean :11
#> 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:3 3rd Qu.:11
#> Max. :1.00000 Max. :1.000000 Max. :3 Max. :11
#>
#> Response
#> Min. :0.0000
#> 1st Qu.:0.0000
#> Median :0.0000
#> Mean :0.1491
#> 3rd Qu.:0.0000
#> Max. :1.0000
#>
ds2 <- select(dataset, -c(ï..ID, Dt_Customer, Z_CostContact, Z_Revenue))
column_names <- names(ds2)
bar_plots_list <- list()
# Iterate over each column and create a barplot using ggplot
for (col in column_names) {
p <- ggplot(ds2, aes(x = .data[[col]])) +
geom_bar() +
labs(title = col)
bar_plots_list[[col]] <- p
}
num_cols_per_page <- 2
num_plots <- length(column_names)
for (i in seq(1, num_plots, by = num_cols_per_page)) {
end <- min(i + num_cols_per_page - 1, num_plots)
plots_subset <- bar_plots_list[i:end]
grid.arrange(grobs = plots_subset, ncol = num_cols_per_page)
}
column_names_box_plot <- c("Year_Birth","Income","Recency","MntWines","MntFruits","MntMeatProducts", "MntFishProducts", "MntSweetProducts", "MntGoldProds", "NumDealsPurchases","NumWebPurchases","NumCatalogPurchases","NumStorePurchases","NumWebVisitsMonth")
box_plots_list <- list()
# Iterate over each column and create a barplot using ggplot
for (col in column_names_box_plot) {
p <- ggplot(ds2, aes(y = .data[[col]])) +
geom_boxplot() +
labs(title = col)
box_plots_list[[col]] <- p
}
num_plots_box <- length(column_names_box_plot)
for (i in seq(1, num_plots_box, by = num_cols_per_page)) {
end <- min(i + num_cols_per_page - 1, num_plots_box)
plots_subset <- box_plots_list[i:end]
grid.arrange(grobs = plots_subset, ncol = num_cols_per_page)
}
The first step is removing the outliers using Interquartile Range for the following variables:
• “Year_Birth”, there were customers with more than 125 years old. • “Income” there are some very high income values.
The second step is encoding some of the categorical variables to include in the model:
• “Marital_Status” attribute, can be narrow the categories into two: 0:Single or 1: Together. • The “Education” attribute consists of the following values 2n Cycle, Basic, Graduation, Master and PhD. We decided to recode them to following 3 groups: Basic (0), Graduation (1), 2n Cycle, Master and PhD (2). With this recoding we now can better track and order these education grades (0,1,2).
#Remove outliers
# Year_birth
tquantile <- quantile(ds2$Year_Birth, probs=c(.25, .75), na.rm = FALSE)
tiqr<- IQR(ds2$Year_Birth)
tlower <- tquantile[1] - 1.5*tiqr
tupper <- tquantile[2] + 1.5*tiqr
ds2<- subset(ds2, ds2$Year_Birth > tlower & ds2$Year_Birth <tupper)
# Income
ds2$Income <- gsub('.{3}$', '', as.character(ds2$Income))
ds2$Income <- gsub('[[:punct:]]', '', as.character(ds2$Income))
ds2$Income <- as.numeric(ds2$Income)
tquantile <- quantile(ds2$Income, probs=c(.25, .75), na.rm = TRUE)
tiqr<- IQR(ds2$Income, na.rm=TRUE)
tlower <- tquantile[1] - 1.5*tiqr
tupper <- tquantile[2] + 1.5*tiqr
ds2<- subset(ds2, ds2$Income > tlower & ds2$Income <tupper)
# Recoding
#Marital status
ds2$Marital_Status<- recode(ds2$Marital_Status, Divorced = 0, Alone = 0, YOLO = 0, Absurd = 0, Divorced = 0, Single = 0, Widow = 0, Married = 1, Together = 1)
#Education
ds2$Education<- recode(ds2$Education, "2n Cycle" = 2, Basic = 0, Graduation = 1, Master= 2, PhD = 2)
ds3 <- ds2
summary(ds3)
#> Year_Birth Education Marital_Status Income
#> Min. :1940 Min. :0.000 Min. :0.0000 Min. : 1.00
#> 1st Qu.:1959 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.: 35.00
#> Median :1970 Median :1.000 Median :1.0000 Median : 51.00
#> Mean :1969 Mean :1.446 Mean :0.6449 Mean : 51.12
#> 3rd Qu.:1977 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.: 68.00
#> Max. :1996 Max. :2.000 Max. :1.0000 Max. :113.00
#> Kidhome Teenhome Recency MntWines
#> Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 0.0
#> 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:24.00 1st Qu.: 24.0
#> Median :0.0000 Median :0.0000 Median :49.00 Median : 178.0
#> Mean :0.4422 Mean :0.5066 Mean :49.01 Mean : 306.2
#> 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:74.00 3rd Qu.: 507.0
#> Max. :2.0000 Max. :2.0000 Max. :99.00 Max. :1493.0
#> MntFruits MntMeatProducts MntFishProducts MntSweetProducts
#> Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.00
#> 1st Qu.: 2.0 1st Qu.: 16.0 1st Qu.: 3.00 1st Qu.: 1.00
#> Median : 8.0 Median : 68.0 Median : 12.00 Median : 8.00
#> Mean : 26.4 Mean : 165.3 Mean : 37.76 Mean : 27.13
#> 3rd Qu.: 33.0 3rd Qu.: 232.0 3rd Qu.: 50.00 3rd Qu.: 34.00
#> Max. :199.0 Max. :1725.0 Max. :259.00 Max. :262.00
#> MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases
#> Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.000
#> 1st Qu.: 9.00 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 0.000
#> Median : 25.00 Median : 2.000 Median : 4.000 Median : 2.000
#> Mean : 44.06 Mean : 2.318 Mean : 4.101 Mean : 2.645
#> 3rd Qu.: 56.00 3rd Qu.: 3.000 3rd Qu.: 6.000 3rd Qu.: 4.000
#> Max. :321.00 Max. :15.000 Max. :27.000 Max. :28.000
#> NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4
#> Min. : 0.000 Min. : 0.000 Min. :0.00000 Min. :0.00000
#> 1st Qu.: 3.000 1st Qu.: 3.000 1st Qu.:0.00000 1st Qu.:0.00000
#> Median : 5.000 Median : 6.000 Median :0.00000 Median :0.00000
#> Mean : 5.824 Mean : 5.337 Mean :0.07392 Mean :0.07438
#> 3rd Qu.: 8.000 3rd Qu.: 7.000 3rd Qu.:0.00000 3rd Qu.:0.00000
#> Max. :13.000 Max. :20.000 Max. :1.00000 Max. :1.00000
#> AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain
#> Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.00000
#> 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
#> Median :0.00000 Median :0.0000 Median :0.00000 Median :0.00000
#> Mean :0.07302 Mean :0.0644 Mean :0.01361 Mean :0.00907
#> 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
#> Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.00000
#> Response
#> Min. :0.000
#> 1st Qu.:0.000
#> Median :0.000
#> Mean :0.151
#> 3rd Qu.:0.000
#> Max. :1.000
It is interesting to see the covariance matrix as we can identify some high positive correlations regarding variable “income” and “MntWines”, “MntMeatProducts”, “NumCatalogPurchases”,”NumStorePurchases” and some high negative correlations regarding income and “Kidhome” and “NumWebVisitsMonth”.
res <- cor(ds3)
corrplot(res, method="number", title= "Correlation matrix",
tl.cex = 0.7, tl.srt = 45, tl.col = "black",
cl.cex = 0.8, cl.ratio = 0.2, cl.align = "r",
addCoef.col = "black", number.digits = 2,number.cex = 0.5,
mar = c(0, 0, 2, 0))
Take a look at some scatter plots related to income and other variables.
##Scatter plots
column_names_scatter_plot <- c("Year_Birth", "Education","Kidhome","Recency","MntWines","MntFruits","MntMeatProducts", "MntFishProducts", "MntSweetProducts", "MntGoldProds", "NumDealsPurchases","NumWebPurchases","NumCatalogPurchases","NumStorePurchases","NumWebVisitsMonth")
scatter_plots_list <- list()
# Iterate over each column and create a scatter using ggplot
for (col in column_names_scatter_plot) {
p <- ggplot(ds3, aes(x = .data[[col]], y = Income)) +
geom_point() +
labs(title = col)
scatter_plots_list[[col]] <- p
}
num_plots_scatter <- length(column_names_scatter_plot)
for (i in seq(1, num_plots_scatter, by = num_cols_per_page)) {
end <- min(i + num_cols_per_page - 1, num_plots_scatter)
plots_subset <- scatter_plots_list[i:end]
grid.arrange(grobs = plots_subset, ncol = num_cols_per_page)
}
For training the model we prepare the data. This will be done in the next following steps:
###Divide train & test
set.seed(123) # For reproducibility
# Generate indices for train and test sets
train_indices <- sample(seq_len(nrow(ds3)), size = floor(0.7 * nrow(ds3)), replace = FALSE)
test_indices <- setdiff(seq_len(nrow(ds3)), train_indices)
# Create train and test datasets
train <- ds3[train_indices, ]
test <- ds3[test_indices, ]
#proportion of response variable
prop.table(table(train$Response))*100 # display the ratio
#>
#> 0 1
#> 85.15878 14.84122
An oversampling is applied to balance classes to approximately 50% for each class.
train_balanced <- ovun.sample(Response ~ ., data = train, method = "over")$data
prop.table(table(train_balanced$Response))*100 # display the ratio
#>
#> 0 1
#> 50.13354 49.86646
When running Naive Bayes classifier it was obtained an accuracy of 76%, 57% Recall and 35% Precision.
nb_model <- naiveBayes(Response ~ ., data = train_balanced)
predictions <- predict(nb_model, newdata = test, type = "class")
cm_nb <- confusionMatrix(predictions, factor(test$Response), positive = "1")
print(cm_nb)
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 449 44
#> 1 109 60
#>
#> Accuracy : 0.7689
#> 95% CI : (0.7349, 0.8005)
#> No Information Rate : 0.8429
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.3042
#>
#> Mcnemar's Test P-Value : 2.29e-07
#>
#> Sensitivity : 0.57692
#> Specificity : 0.80466
#> Pos Pred Value : 0.35503
#> Neg Pred Value : 0.91075
#> Prevalence : 0.15710
#> Detection Rate : 0.09063
#> Detection Prevalence : 0.25529
#> Balanced Accuracy : 0.69079
#>
#> 'Positive' Class : 1
#>
When looking at the confusion matrix it can be observed an 80% accuracy, 77% Recall and 43% Precision. It is a bit better than the model with Naive Bayes.
Some of the most important variables are: Recency, StorePurchases, Marital Status, Education.
#fit logistic regression model
#disable scientific notation for model summary
options(scipen=999)
model <- glm(Response ~.,family="binomial", data=train_balanced)
#view model summary
summary(model)
#>
#> Call:
#> glm(formula = Response ~ ., family = "binomial", data = train_balanced)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -3.8045 -0.6686 -0.1009 0.6813 2.6643
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 21.0214551 10.3529623 2.030 0.042308 *
#> Year_Birth -0.0120069 0.0052405 -2.291 0.021952 *
#> Education 0.9531867 0.1072961 8.884 < 0.0000000000000002 ***
#> Marital_Status -1.0526061 0.1097368 -9.592 < 0.0000000000000002 ***
#> Income -0.0043446 0.0064948 -0.669 0.503532
#> Kidhome -0.0861538 0.1481035 -0.582 0.560760
#> Teenhome -1.0066680 0.1454380 -6.922 0.000000000004464730 ***
#> Recency -0.0227471 0.0018866 -12.057 < 0.0000000000000002 ***
#> MntWines -0.0001223 0.0002804 -0.436 0.662704
#> MntFruits 0.0054039 0.0019625 2.754 0.005894 **
#> MntMeatProducts 0.0032034 0.0004016 7.976 0.000000000000001512 ***
#> MntFishProducts 0.0005507 0.0014396 0.383 0.702083
#> MntSweetProducts 0.0043246 0.0017831 2.425 0.015294 *
#> MntGoldProds -0.0002267 0.0012520 -0.181 0.856320
#> NumDealsPurchases 0.2385229 0.0393975 6.054 0.000000001410592464 ***
#> NumWebPurchases 0.0990963 0.0307425 3.223 0.001267 **
#> NumCatalogPurchases 0.1933626 0.0322909 5.988 0.000000002122549355 ***
#> NumStorePurchases -0.2635514 0.0268754 -9.806 < 0.0000000000000002 ***
#> NumWebVisitsMonth 0.3413908 0.0426229 8.010 0.000000000000001151 ***
#> AcceptedCmp3 1.2104734 0.1849638 6.544 0.000000000059742433 ***
#> AcceptedCmp4 0.8845757 0.2300388 3.845 0.000120 ***
#> AcceptedCmp5 1.9456921 0.2393113 8.130 0.000000000000000428 ***
#> AcceptedCmp1 0.9050976 0.2383136 3.798 0.000146 ***
#> AcceptedCmp2 1.5188642 0.4809556 3.158 0.001588 **
#> Complain -2.0488798 1.0176357 -2.013 0.044075 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 3633.5 on 2620 degrees of freedom
#> Residual deviance: 2260.1 on 2596 degrees of freedom
#> AIC: 2310.1
#>
#> Number of Fisher Scoring iterations: 5
probabilities <- model %>% predict(test, type = "response")
predicted_classes <- ifelse(probabilities > 0.5, "1", "0")
cm_lr1 <-confusionMatrix(as.factor(predicted_classes),as.factor(test$Response), positive = "1")
print(cm_lr1)
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 453 23
#> 1 105 81
#>
#> Accuracy : 0.8066
#> 95% CI : (0.7745, 0.8361)
#> No Information Rate : 0.8429
#> P-Value [Acc > NIR] : 0.9947
#>
#> Kappa : 0.4472
#>
#> Mcnemar's Test P-Value : 0.00000000000081
#>
#> Sensitivity : 0.7788
#> Specificity : 0.8118
#> Pos Pred Value : 0.4355
#> Neg Pred Value : 0.9517
#> Prevalence : 0.1571
#> Detection Rate : 0.1224
#> Detection Prevalence : 0.2810
#> Balanced Accuracy : 0.7953
#>
#> 'Positive' Class : 1
#>
# 80% accuracy
# 77% Recall
# 43% Precision
# 55% F1
#Variable importance
vlr <- varImp(model)
VI_lr<- data.frame(var=names(train_balanced[,-25]), imp=vlr)
VI_plot_lr <- VI_lr[order(VI_lr$Overall,decreasing=FALSE),]
barplot(VI_plot_lr$Overall,
names.arg=rownames(VI_plot_lr),
horiz=TRUE,
col='steelblue',
xlab='Variable Importance',
main="Variable importance logistic regression",
las = 2,
cex.names = 0.65)
It was observed in the covariance matrix that we had many highly correlated variables, thus we proceed to apply a Principal Component Analysis. This technique will allow to reduce the dimensionality of the data while capturing the most important patterns.
###PCA
d_pca<- prcomp(train_balanced, center = TRUE,scale. = TRUE)
pca_var<- d_pca$sdev^2
pve <- pca_var/sum(pca_var)
The variability explain by the variables can be observed in the accumulated sum graph. The 80% of the variablity is explained by the first 12 variables. Those would be used to model the logistic regression.
plot(cumsum(pve), xlab="Principal Component",
ylab="Proportion of variation explained",
ylim=c(0,1),
type="b",
main= "CUMSUM Scree Plot",
)
abline(h=0.8, col="red")
pcadata <- data.frame(Response = train_balanced[,"Response"],d_pca$x[,1:12])
After applying PCA to the train data set, the logistic regression was run again and obtained an accuracy of 96.8% for this model.Also, 97% Recall and 84% Precision.
This model seems good, but has the problem that it is not an explicative model as we cannot interpret the created variables with PCA. Then it should be considered obtaining an explainable model also, as it is relevant to explain which variables are the most important.
model3<- glm(Response ~. ,family="binomial", data=pcadata)
#view model summary
summary(model3)
#>
#> Call:
#> glm(formula = Response ~ ., family = "binomial", data = pcadata)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -5.1897 -0.0186 -0.0001 0.0363 2.6581
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.47520 0.17294 -2.748 0.005999 **
#> PC1 1.29896 0.09996 12.995 < 0.0000000000000002 ***
#> PC2 0.14285 0.09420 1.516 0.129393
#> PC3 -3.86215 0.28366 -13.615 < 0.0000000000000002 ***
#> PC4 4.55588 0.32293 14.108 < 0.0000000000000002 ***
#> PC5 1.22139 0.16646 7.337 0.000000000000218 ***
#> PC6 -1.50029 0.16522 -9.081 < 0.0000000000000002 ***
#> PC7 -0.66438 0.17870 -3.718 0.000201 ***
#> PC8 -0.36397 0.16531 -2.202 0.027684 *
#> PC9 0.23314 0.18109 1.287 0.197940
#> PC10 -3.03691 0.28000 -10.846 < 0.0000000000000002 ***
#> PC11 -0.34858 0.19699 -1.770 0.076801 .
#> PC12 -1.58912 0.22844 -6.956 0.000000000003489 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 3633.46 on 2620 degrees of freedom
#> Residual deviance: 317.89 on 2608 degrees of freedom
#> AIC: 343.89
#>
#> Number of Fisher Scoring iterations: 9
# preprocess the test data as it was done with train applying pca
test.p <- predict(d_pca, newdata = test[,1:25])
predict_pca <- predict(model3, newdata=data.frame(test.p[,1:12]), type="response")
predicted_classes3 <- factor(ifelse(predict_pca >= 0.5, "1", "0"))
#confusion matrix
cm_pca <- confusionMatrix(as.factor(predicted_classes3),as.factor(test$Response), positive = "1")
print(cm_pca)
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 540 3
#> 1 18 101
#>
#> Accuracy : 0.9683
#> 95% CI : (0.9519, 0.9803)
#> No Information Rate : 0.8429
#> P-Value [Acc > NIR] : < 0.0000000000000002
#>
#> Kappa : 0.8869
#>
#> Mcnemar's Test P-Value : 0.00225
#>
#> Sensitivity : 0.9712
#> Specificity : 0.9677
#> Pos Pred Value : 0.8487
#> Neg Pred Value : 0.9945
#> Prevalence : 0.1571
#> Detection Rate : 0.1526
#> Detection Prevalence : 0.1798
#> Balanced Accuracy : 0.9694
#>
#> 'Positive' Class : 1
#>
# 96.8% accuracy
# 97% Recall
# 84% Precision
# 90% F1
Decision tree model provides 81% accuracy, 43% Precision and 62% Recall.
Some of the most important variables are: MntMeatProducts, MntGoldProds, Income, MntWines, NumCatalagPurchases.
tree_over <- rpart(Response ~ ., data = train_balanced ,method = 'class')
rpart.plot(tree_over, extra = 106)
pred_tree_over <- predict(tree_over, newdata = test, type="class")
#Confusion matrix
cm_dt<-confusionMatrix(pred_tree_over, as.factor(test$Response), positive = "1")
print(cm_dt)
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 473 39
#> 1 85 65
#>
#> Accuracy : 0.8127
#> 95% CI : (0.7808, 0.8417)
#> No Information Rate : 0.8429
#> P-Value [Acc > NIR] : 0.9841
#>
#> Kappa : 0.4006
#>
#> Mcnemar's Test P-Value : 0.0000532
#>
#> Sensitivity : 0.62500
#> Specificity : 0.84767
#> Pos Pred Value : 0.43333
#> Neg Pred Value : 0.92383
#> Prevalence : 0.15710
#> Detection Rate : 0.09819
#> Detection Prevalence : 0.22659
#> Balanced Accuracy : 0.73634
#>
#> 'Positive' Class : 1
#>
#variable importance
vtree <- varImp(tree_over)
VI_dt<- data.frame(var=names(train_balanced[,-9]), imp=vtree)
VI_plot_dt <- VI_dt[order(VI_dt$Overall, decreasing=FALSE),]
par(mar = c(2, 10, 4, 2) + 0.1)
barplot(VI_plot_dt$Overall,
names.arg=rownames(VI_plot_dt),
horiz=TRUE,
las = 1,
col='steelblue',
xlab='Variable Importance',
main="Variable importance of the Decision Tree",
las = 2,
cex.names = 0.65)
Random Forest model provides 88% accuracy, 76% Precision and 40% Recall.
Some of the most important variables are: Recency, MntGoldProds, MntMeatProducts, MntWines, NumStorePurchases.
rf_default <- train(as.factor(Response)~.,data = train_balanced, method = "rf", metric = "Accuracy", importance=TRUE)
#predict
p1 <- predict(rf_default, test)
#accuracy
cm_rf <-confusionMatrix(p1, as.factor(test$Response), positive ="1")
print(cm_rf)
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 545 62
#> 1 13 42
#>
#> Accuracy : 0.8867
#> 95% CI : (0.8601, 0.9098)
#> No Information Rate : 0.8429
#> P-Value [Acc > NIR] : 0.0007874
#>
#> Kappa : 0.4708
#>
#> Mcnemar's Test P-Value : 0.00000002981
#>
#> Sensitivity : 0.40385
#> Specificity : 0.97670
#> Pos Pred Value : 0.76364
#> Neg Pred Value : 0.89786
#> Prevalence : 0.15710
#> Detection Rate : 0.06344
#> Detection Prevalence : 0.08308
#> Balanced Accuracy : 0.69027
#>
#> 'Positive' Class : 1
#>
#variable importance
imp_rf <- varImp(rf_default)
VI_rf<- data.frame(var=names(train_balanced[,-25]), imp=imp_rf$importance)
VI_rf[,2] <- NULL
VI_plot_rf <- VI_rf[order(VI_rf$imp.1, decreasing= FALSE),]
par(mar = c(2, 10, 4, 2) + 0.1)
barplot(VI_plot_rf$imp.1,
names.arg=rownames(VI_plot_rf),
horiz=TRUE,
las = 1,
col='steelblue',
xlab='Variable Importance',
main="Variable importance of the Random Forest",
las = 2,
cex.names = 0.65)
Linear Discriminant Analysis model provides 81% accuracy, 44% Precision and 76% Recall.
Some of the most important variables are: AcceptedCmp5, MntMeatProducts, MntWines, NumCatalogPurchases, AcceptedCmp1, Recency.
lda <- train(as.factor(Response) ~ .,method="lda", data = train_balanced)
pred_lda <- predict(lda, test)
#confusion matrix
cm_lda <- confusionMatrix(as.factor(pred_lda), as.factor(test$Response), positive="1")
print(cm_lda)
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 457 24
#> 1 101 80
#>
#> Accuracy : 0.8112
#> 95% CI : (0.7792, 0.8403)
#> No Information Rate : 0.8429
#> P-Value [Acc > NIR] : 0.9877
#>
#> Kappa : 0.4521
#>
#> Mcnemar's Test P-Value : 0.00000000001063
#>
#> Sensitivity : 0.7692
#> Specificity : 0.8190
#> Pos Pred Value : 0.4420
#> Neg Pred Value : 0.9501
#> Prevalence : 0.1571
#> Detection Rate : 0.1208
#> Detection Prevalence : 0.2734
#> Balanced Accuracy : 0.7941
#>
#> 'Positive' Class : 1
#>
#Variable importance
vida <- varImp(lda)
VI_lda<- data.frame(var=names(train_balanced[,-25]), imp=vida$importance)
VI_plot_lda <- VI_lda[order(VI_lda$Overall, decreasing=FALSE),]
par(mar = c(2, 10, 4, 2) + 0.1)
barplot(VI_plot_lda$Overall,
names.arg=rownames(VI_plot_lda),
horiz=TRUE,
las = 1,
col='steelblue',
xlab='Variable Importance',
main="Variable Importance LDA",
las = 2,
cex.names = 0.65)
It can be seen that the best model is the logistic regression after PCA with 96.8% accuracy, 97% Recall and 84% Precision. As we intend to classify an unbalanced data set where the positive class is important, the precision is a valuable metric to use.
comparecm<-cbind(cm_nb$byClass,cm_lr1$byClass,cm_pca$byClass,cm_dt$byClass,cm_rf$byClass, cm_lda$byClass)
comparecm2<-cbind(cm_nb$overall["Accuracy"],cm_lr1$overall["Accuracy"],cm_pca$overall["Accuracy"],cm_dt$overall["Accuracy"],cm_rf$overall["Accuracy"], cm_lda$overall["Accuracy"])
comparison <- data.frame(rbind(comparecm2,comparecm))
colnames(comparison) <- c("Naive Bayes", "Logistic Regression","Logistic Regression after PCA", "Decision Tree", "Random Forest", "LDA")
print(comparison)
#> Naive Bayes Logistic Regression
#> Accuracy 0.76888218 0.8066465
#> Sensitivity 0.57692308 0.7788462
#> Specificity 0.80465950 0.8118280
#> Pos Pred Value 0.35502959 0.4354839
#> Neg Pred Value 0.91075051 0.9516807
#> Precision 0.35502959 0.4354839
#> Recall 0.57692308 0.7788462
#> F1 0.43956044 0.5586207
#> Prevalence 0.15709970 0.1570997
#> Detection Rate 0.09063444 0.1223565
#> Detection Prevalence 0.25528701 0.2809668
#> Balanced Accuracy 0.69079129 0.7953371
#> Logistic Regression after PCA Decision Tree Random Forest
#> Accuracy 0.9682779 0.81268882 0.88670695
#> Sensitivity 0.9711538 0.62500000 0.40384615
#> Specificity 0.9677419 0.84767025 0.97670251
#> Pos Pred Value 0.8487395 0.43333333 0.76363636
#> Neg Pred Value 0.9944751 0.92382812 0.89785832
#> Precision 0.8487395 0.43333333 0.76363636
#> Recall 0.9711538 0.62500000 0.40384615
#> F1 0.9058296 0.51181102 0.52830189
#> Prevalence 0.1570997 0.15709970 0.15709970
#> Detection Rate 0.1525680 0.09818731 0.06344411
#> Detection Prevalence 0.1797583 0.22658610 0.08308157
#> Balanced Accuracy 0.9694479 0.73633513 0.69027433
#> LDA
#> Accuracy 0.8111782
#> Sensitivity 0.7692308
#> Specificity 0.8189964
#> Pos Pred Value 0.4419890
#> Neg Pred Value 0.9501040
#> Precision 0.4419890
#> Recall 0.7692308
#> F1 0.5614035
#> Prevalence 0.1570997
#> Detection Rate 0.1208459
#> Detection Prevalence 0.2734139
#> Balanced Accuracy 0.7941136
Regarding variable importance, the following table contains the most important variables considered by each model to predict the acceptance of a campaign or not.
The threshold to count it as important is if the model assigned a weight of at least the 50% of the highest weight.
The important variables that repeat the most are “Recency”,“MntMeatProducts”, “MntWines”.
Other important variables are “Income”,“Education”, “Marital_Status”, “Teenhome”, “NumCatalogPurchases”, “NumStorePurchases”, “NumWebVisitsMonth”, “AcceptedCmp3”, “AcceptedCmp5”.
variables <- cbind( vlr,vtree, imp_rf$importance["0"], vida$importance)
names(variables) <- c('Decision tree', 'Logistic regression', 'Random forest','LDA')
result_list <- list()
for (col_name in colnames(variables)) {
col_idx <- which(colnames(variables) == col_name)
max_value <- max(variables[, col_idx])
threshold <- 0.5 * max_value
rows_above_threshold <- rownames(variables)[variables[, col_idx] >= threshold]
result_list[[col_name]] <- rows_above_threshold
}
important_variables <- unlist(result_list)
variable_counts <- table(important_variables)
sorted_variable_counts <- sort(variable_counts, decreasing = TRUE)
print(sorted_variable_counts)
#> important_variables
#> Recency MntMeatProducts MntWines AcceptedCmp3
#> 4 3 3 2
#> AcceptedCmp5 Education Income Marital_Status
#> 2 2 2 2
#> MntFruits NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
#> 2 2 2 2
#> Teenhome AcceptedCmp1 MntFishProducts MntGoldProds
#> 2 1 1 1
#> MntSweetProducts NumDealsPurchases NumWebPurchases Year_Birth
#> 1 1 1 1
Running a PCA previous to a logistic regression provided a massive result with an accuracy of 96.8%, 97% Recall and 84% Precision. Yet, this model has a disadvantage that we cannot identify which variables are the most important to predict the acceptance of a campaign or not. This knowledge may be crucial for the marketing department, thus, we run other models (decision tree, random forest and LDA) to obtain the variable importance and compare them to obtain better insights. The most important variables are “Recency”,“MntMeatProducts”, “MntWines”.
Then, the logistic regression after PCA model should be used to predict the acceptance of a campaign but the variable importance information could be also used for marketing purposes.
It was created some aggregated variables:
#for this task we will disregard the campaign acceptance and focus on customer data and consuming habbits.
ds4 <- select(ds3, -c(AcceptedCmp3,AcceptedCmp1, AcceptedCmp2, AcceptedCmp4, AcceptedCmp5, Complain, Response))
#we proceed to transform some of the variables for a more comprehensive analysis.
#Age as of 2023
ds4$age <- 2023 - ds4$Year_Birth
# Divide age in categories. The values were chosen arbitrarily.
ds4$category_age <- ifelse(ds4$age<=45,"Young",ifelse(ds4$age>=63,"Old","Adults"))
#How many kids have a client in total and a dummy variable if has kids or not
ds4$quantity_of_kids <- ds4$Kidhome+ds4$Teenhome
ds4$have_kids <- ifelse(ds4$quantity_of_kids>0,1,0)
#Total amount spent by each client
ds4$total_amount_spent <- ds4$MntWines+ds4$MntFruits+ds4$MntMeatProducts+ds4$MntFishProducts+ds4$MntSweetProducts+ds4$MntGoldProds
#How many purchases make each client
ds4$quantity_of_purchases <- ds4$NumWebPurchases+ds4$NumCatalogPurchases+ds4$NumStorePurchases
#We will drop some variables we used to transform data
ds4 <- select(ds4, -c(Year_Birth, Kidhome, Teenhome))
summary(ds4)
#> Education Marital_Status Income Recency
#> Min. :0.000 Min. :0.0000 Min. : 1.00 Min. : 0.00
#> 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.: 35.00 1st Qu.:24.00
#> Median :1.000 Median :1.0000 Median : 51.00 Median :49.00
#> Mean :1.446 Mean :0.6449 Mean : 51.12 Mean :49.01
#> 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.: 68.00 3rd Qu.:74.00
#> Max. :2.000 Max. :1.0000 Max. :113.00 Max. :99.00
#> MntWines MntFruits MntMeatProducts MntFishProducts
#> Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.00
#> 1st Qu.: 24.0 1st Qu.: 2.0 1st Qu.: 16.0 1st Qu.: 3.00
#> Median : 178.0 Median : 8.0 Median : 68.0 Median : 12.00
#> Mean : 306.2 Mean : 26.4 Mean : 165.3 Mean : 37.76
#> 3rd Qu.: 507.0 3rd Qu.: 33.0 3rd Qu.: 232.0 3rd Qu.: 50.00
#> Max. :1493.0 Max. :199.0 Max. :1725.0 Max. :259.00
#> MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases
#> Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
#> 1st Qu.: 1.00 1st Qu.: 9.00 1st Qu.: 1.000 1st Qu.: 2.000
#> Median : 8.00 Median : 25.00 Median : 2.000 Median : 4.000
#> Mean : 27.13 Mean : 44.06 Mean : 2.318 Mean : 4.101
#> 3rd Qu.: 34.00 3rd Qu.: 56.00 3rd Qu.: 3.000 3rd Qu.: 6.000
#> Max. :262.00 Max. :321.00 Max. :15.000 Max. :27.000
#> NumCatalogPurchases NumStorePurchases NumWebVisitsMonth age
#> Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. :27.0
#> 1st Qu.: 0.000 1st Qu.: 3.000 1st Qu.: 3.000 1st Qu.:46.0
#> Median : 2.000 Median : 5.000 Median : 6.000 Median :53.0
#> Mean : 2.645 Mean : 5.824 Mean : 5.337 Mean :54.1
#> 3rd Qu.: 4.000 3rd Qu.: 8.000 3rd Qu.: 7.000 3rd Qu.:64.0
#> Max. :28.000 Max. :13.000 Max. :20.000 Max. :83.0
#> category_age quantity_of_kids have_kids total_amount_spent
#> Length:2205 Min. :0.0000 Min. :0.0000 Min. : 5.0
#> Class :character 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 69.0
#> Mode :character Median :1.0000 Median :1.0000 Median : 397.0
#> Mean :0.9488 Mean :0.7152 Mean : 606.8
#> 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1047.0
#> Max. :3.0000 Max. :1.0000 Max. :2525.0
#> quantity_of_purchases
#> Min. : 0.00
#> 1st Qu.: 6.00
#> Median :12.00
#> Mean :12.57
#> 3rd Qu.:18.00
#> Max. :32.00
ggplot(ds4, aes(y= total_amount_spent, x= as.factor(have_kids), group = have_kids, fill = as.factor(have_kids)))+geom_boxplot()+labs(title ="Boxplot have kids") #no kids, more spent
ggplot(ds4, aes(y= total_amount_spent, x= as.factor(Marital_Status), group = Marital_Status, fill = as.factor(Marital_Status)))+geom_boxplot()+labs(title ="Boxplot marital status") #no effect
ggplot(ds4, aes(y= total_amount_spent, x= as.factor(Education), group = Education, fill = as.factor(Education)))+geom_boxplot()+labs(title ="Boxplot Education")#Graduated and phd spent more than non graduate ones
ggplot(ds4, aes(y= total_amount_spent, x= category_age, group = category_age, fill = category_age))+geom_boxplot()+labs(title ="Boxplot age category")#The older the more they spent
ggplot(ds4, aes(y= total_amount_spent, x= Income, group = Income, fill = Income))+geom_boxplot()+labs(title ="Boxplot income")#The higher the income the more they spent
ggplot(ds4, aes(y= quantity_of_purchases, x= as.factor(have_kids), group = have_kids, fill = as.factor(have_kids)))+geom_boxplot()+labs(title ="Boxplot have kids") #no kids, purchased more times.
ggplot(ds4, aes(y= quantity_of_purchases, x= as.factor(Marital_Status), group = Marital_Status, fill = as.factor(Marital_Status)))+geom_boxplot()+labs(title ="Boxplot Marital Status") #no effect
ggplot(ds4, aes(y= quantity_of_purchases, x= as.factor(Education), group = Education, fill = as.factor(Education)))+geom_boxplot()+labs(title ="Boxplot Education")#Graduated and phd spent more than non graduate ones
ggplot(ds4, aes(y= quantity_of_purchases, x= category_age, group = category_age, fill = category_age))+geom_boxplot()+labs(title ="Boxplot Age category")#The older the more times they purchase
ggplot(ds4, aes(y= quantity_of_purchases, x= Income, group = Income, fill = Income))+geom_boxplot()+labs(title ="Boxplot Income")#The higher the income the more times they purchase
After the analysis it was decided to cluster based on total amount spent and quantity of purchases and use demographic information of clients to build the profiling of the clusters.
ds5 <- select(ds4, c(total_amount_spent, quantity_of_purchases))
ds4$category_age<-sapply(as.factor(ds4$category_age), unclass)
The elbow method suggest 2 to 3 clusters.
fviz_nbclust(ds5, kmeans, method = "wss")
The silhouette method also suggest 2 to 3 clusters.
fviz_nbclust(ds5, kmeans, method = "silhouette")
###
Clustering
km <- kmeans(ds5, centers = 3)
#visualize clusters
fviz_cluster(km, data = ds5, xlab="Total Amount Spent", ylab = "Quantity of Purchases",) +
theme(plot.title = element_text(hjust = 0.5, size = 16))
Mean values
of each variable for each cluster
#add column to dataset
ds4$kmean <- km$cluster
#cluster characterization
aggregate(ds4,list(ds4$kmean),mean)
#> Group.1 Education Marital_Status Income Recency MntWines MntFruits
#> 1 1 1.464789 0.6225352 76.80000 50.79437 833.44789 66.456338
#> 2 2 1.477537 0.6405990 65.22296 48.80865 473.58902 43.425957
#> 3 3 1.425941 0.6533227 37.04243 48.59808 75.73419 6.827862
#> MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds
#> 1 522.81408 95.35493 70.092958 81.67887
#> 2 228.91514 61.27121 43.717138 70.54908
#> 3 33.09528 10.07046 6.934347 20.61649
#> NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases
#> 1 1.515493 5.552113 6.1690141 8.335211
#> 2 2.728785 6.093178 4.3777038 8.660566
#> 3 2.349079 2.729384 0.8102482 3.744596
#> NumWebVisitsMonth age category_age quantity_of_kids have_kids
#> 1 3.442254 54.80000 1.833803 0.2873239 0.2507042
#> 2 4.379368 56.68220 1.715474 0.7504160 0.6439268
#> 3 6.336269 52.65092 1.780624 1.2321857 0.8815052
#> total_amount_spent quantity_of_purchases kmean
#> 1 1669.8451 20.056338 1
#> 2 921.4676 19.131448 2
#> 3 153.2786 7.284227 3
Barplot: individuals per cluster
#barplot
ggplot(ds4, aes( x=factor(kmean)))+ geom_bar() + labs(title = "Distribution of Clusters") + xlab("k-mean") +
scale_x_discrete(labels=c("3" = "Low spenders", "1" = "Mid spenders", "2" = "High spenders"))+ theme(plot.title = element_text(hjust = 0.5, size = 16))
It was run the method k-means and it was decided for 3 clusters that are here characterized.
Cluster 3: “Low spenders” - This group has the lower income (~37k) and tend to have more than one kid, spend the least in all the categories and tend to chase deal purchases and visit the most often the website. Performed low quantities of purchases using normally the store or web but not really the Catalog.
Cluster 1: “Mid spenders” -This group has medium income (~65k) and tend to have one kid, spend a good amount in all the categories and tend to chase deal purchases and visit the website often. Performed high quantities of purchases using all the channels but with a preference for store purchases.
Cluster 2: “high spenders” - This group has the higher income (~77k) and normally have no kids, spent the most in all the categories, do not chase deal purchases and do not visit often the website. Performed high quantities of purchases using all the channels but with a preference for store purchases.
For the Linear Regression, it was kept the same clusters and compared how much every Cluster spent in relation to the total amount spent. For the “Low Spenders” expenditure was 15%, the “Mid Spenders” had 41% and the “High Spenders” spent 44% of the overall expenditure.
clust_mid <- filter(ds4, ds4$kmean == 1)
clust_high <- filter(ds4, ds4$kmean == 2)
clust_low <- filter(ds4, ds4$kmean == 3)
a<-sum(clust_low$total_amount_spent) #190.393
b<-sum(clust_mid$total_amount_spent) #553.802
c<-sum(clust_high$total_amount_spent) #591.065
d<-a+b+c
#porcentages of total amount spent by cluster
#15% cluster_low
#41% cluster_mid
#44% cluster_high
Education, kids and deal purchases reduce the total amount spent for the “Low Spenders”. On the other hand, Income and purchases through all the channels and visits on the website increase the total amount spent.
model_clust_low <- lm(total_amount_spent~Education+Income+Marital_Status+have_kids+age+
Recency+NumDealsPurchases+NumWebPurchases+NumCatalogPurchases+NumStorePurchases+NumWebVisitsMonth,
data=clust_low)
#keep only significant variables
model_clust_low_2 <-lm(total_amount_spent~Education+Income+have_kids+ NumDealsPurchases+NumWebPurchases+NumCatalogPurchases+NumStorePurchases+NumWebVisitsMonth, data=clust_low)
summary(model_clust_low_2)
#>
#> Call:
#> lm(formula = total_amount_spent ~ Education + Income + have_kids +
#> NumDealsPurchases + NumWebPurchases + NumCatalogPurchases +
#> NumStorePurchases + NumWebVisitsMonth, data = clust_low)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -291.770 -19.079 -2.995 16.712 299.366
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -225.1477 6.8179 -33.023 < 0.0000000000000002 ***
#> Education -5.2097 1.7813 -2.925 0.003511 **
#> Income 1.0056 0.1032 9.748 < 0.0000000000000002 ***
#> have_kids -12.8495 3.3254 -3.864 0.000117 ***
#> NumDealsPurchases -1.8030 0.7511 -2.401 0.016515 *
#> NumWebPurchases 25.0731 0.5859 42.797 < 0.0000000000000002 ***
#> NumCatalogPurchases 51.0176 1.1862 43.010 < 0.0000000000000002 ***
#> NumStorePurchases 44.7064 0.8588 52.057 < 0.0000000000000002 ***
#> NumWebVisitsMonth 13.7287 0.6380 21.519 < 0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 35.07 on 1240 degrees of freedom
#> Multiple R-squared: 0.943, Adjusted R-squared: 0.9426
#> F-statistic: 2564 on 8 and 1240 DF, p-value: < 0.00000000000000022
For the “Mid Spenders” the number of catalog and store purchases are important as well as number of web visits per month. Kids produce a negative effect on the amount spent, while income has a positive effect.
model_clust_mid <- lm(total_amount_spent~Education+Income+Marital_Status+have_kids+age+
Recency+NumDealsPurchases+NumWebPurchases+NumCatalogPurchases+NumStorePurchases+NumWebVisitsMonth,
data=clust_mid)
#keep only significant variables
model_clust_mid_2 <-lm(total_amount_spent~Income+have_kids+NumCatalogPurchases+NumStorePurchases+NumWebVisitsMonth,
data=clust_mid)
summary(model_clust_mid_2)
#>
#> Call:
#> lm(formula = total_amount_spent ~ Income + have_kids + NumCatalogPurchases +
#> NumStorePurchases + NumWebVisitsMonth, data = clust_mid)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -516.34 -189.03 -41.16 151.03 786.33
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 859.322 154.016 5.579 0.00000004847 ***
#> Income 9.711 1.601 6.065 0.00000000344 ***
#> have_kids -122.774 39.492 -3.109 0.00203 **
#> NumCatalogPurchases 10.529 5.566 1.892 0.05935 .
#> NumStorePurchases -6.023 4.896 -1.230 0.21949
#> NumWebVisitsMonth 23.453 9.027 2.598 0.00977 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 256.9 on 349 degrees of freedom
#> Multiple R-squared: 0.1356, Adjusted R-squared: 0.1232
#> F-statistic: 10.95 on 5 and 349 DF, p-value: 0.0000000008429
In case of the “High Spenders”, education and income increase the amount spent significantly as well as the number of web visits per month. Because this group has the least web visits per month and they contribute much to the amount spent, finding a way to increase the web visits would be a good way to encourage them to spend more money.
model_clust_high <- lm(total_amount_spent~Education+Income+Marital_Status+have_kids+age+Recency+NumDealsPurchases+
NumWebPurchases+NumCatalogPurchases+NumStorePurchases+NumWebVisitsMonth, data=clust_high)
#keep only significant variables
model_clust_high_2 <-lm(total_amount_spent~Education+Income+have_kids+NumWebVisitsMonth, data=clust_high)
summary(model_clust_high_2)
#>
#> Call:
#> lm(formula = total_amount_spent ~ Education + Income + have_kids +
#> NumWebVisitsMonth, data = clust_high)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -602.81 -142.49 -0.11 151.95 427.10
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 458.156 75.617 6.059 0.0000000024302054 ***
#> Education 14.034 15.572 0.901 0.3678
#> Income 7.109 0.907 7.838 0.0000000000000212 ***
#> have_kids -117.380 19.392 -6.053 0.0000000025162304 ***
#> NumWebVisitsMonth 12.442 4.893 2.543 0.0112 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 190.6 on 596 degrees of freedom
#> Multiple R-squared: 0.195, Adjusted R-squared: 0.1896
#> F-statistic: 36.08 on 4 and 596 DF, p-value: < 0.00000000000000022